Linear Regression Project is my own exercise project from Udemy Python for Data Science and Machine Learning Bootcamp by Jose Portilla. In this project, I will analyze the ecommerce dummy dataset and decide whether to focus their efforts on their mobile app experience or their website. I use Python Anaconda using Jupyter Notebook.
Project Intro/Objective
The purpose of this project is use linear regression to decide whether to focus their efforts on their mobile app experience or their website.
Project Library
- Numpy
- Pandas
- Matplotlib
- Seaborn
- scikit-learn (sklearn)
Data and Setup
In this section, I want to show some of the data information, it includes data frame head, information, and description.
customers data contains Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:
- Avg. Session Length: Average session of in-store style advice sessions.
- Time on App: Average time spent on App in minutes
- Time on Website: Average time spent on Website in minutes
- Length of Membership: How many years the customer has been a member.
Exploratory Data Analysis
In this section, I want to exlpore all of the numerical data. Because in my opinion, this numerical value will contribute more to understand the customer behaviour rather than alphanumeric data like email, address, and avatar.
Jointplot
In this section, I will try to analyze the numerical data using jointplot. I use jointplot because I can analyze two variables using this plot. First, I want to analyze the relationship between Time on Website and Yearly Amount Spent.
Second, I want to analyze the relationship between Time on App and Yearly Amount Spent.
Third, I want to analyze the relationship between Time on App/Website and Length of Membership.
From all of the jointplots above, it shows the normal distribution. Where person who spent an average time on website or app will spent more than person who spent to little time or to musch time. And person who has average length of membership will spent more time on app than person who is a new membership or old membership.
Then I decide to look the relationship accross entire dataset. I will use pairplot to get this relationship information.
Based on the pairplot above, I can see that length of membership is the most correlated feature with Yearly Amount Spent. To see this correlated data more clearly, I will plot this two column using lmplot.
Linear Regression Model
Training and Testing Data
In this section, I will split the data into training and testing sets. My target is yearly amount spent column and I will set it to variable y, then I set variable X equal to the numerical features of the customers column.
I will use model_selection.train_test_split from sklearn to split the data into training and testing sets. Then set test_size=0.3 and random_state=101.
Training the Model
In this section, I will use LinearRegression module from sklearn, then create an instance called lm and fit the X_train and y_train to lm. After that, we got lm.foef_, later I will interpret this coefficient.
Predicting Test Data
After fitting our model, I will try to predict the yearly amount spent then compare my prediction value to real value using scatter plot.
From scatte plot above, I can see that my model is good enough to predict the values because I got almost idenctical data with the real values.
Evaluating the Model
To evaluate the model, I will calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error. The result is good enough because I get the minimum error for this three metric.
Residuals
To make sure everything was okay with the data, I will try to plot a histogram and see if my predictions is normally distributed.
From the plot above, I got my predictions value is normally distributed. It confirm that my model is good enough to predicts the yearly amount spent. To further analysis, I will try to interpret the coefficient value.
Conclusions
From the coefficient above, I will create a new dataframe to make it easier to read.
Interpreting the coefficients:
- Holding all other features fixed, a 1 unit increase in Avg. Session Length is associated with an increase of 25.98 total dollars spent.
- Holding all other features fixed, a 1 unit increase in Time on App is associated with an increase of 38.59 total dollars spent.
- Holding all other features fixed, a 1 unit increase in Time on Website is associated with an increase of 0.19 total dollars spent.
- Holding all other features fixed, a 1 unit increase in Length of Membership is associated with an increase of 61.27 total dollars spent.
From all of our data visualization and our linear regression model, there are two factors to consider:
- From the jointplot, there are more people using App than Website.
- From the coefficient, people who using App tend to spent more money than people who using Website.
Consider the two factors above, if the company want to increase their profit significantly, the company can develop app more. But, if the company want to increase the profit gradually, the company can develop the Website to catch up to the performance of the mobile app.
Additional Resources
- Header Backgrounds by CardMapr at unsplash.com
- For further explanation regarding python code, please kindly check this link.